Skip to content

Make MultiHeadAttention op return attention probabilities#23125

Closed
amancini-N wants to merge 2 commits intomicrosoft:mainfrom
amancini-N:attn-probs-mha
Closed

Make MultiHeadAttention op return attention probabilities#23125
amancini-N wants to merge 2 commits intomicrosoft:mainfrom
amancini-N:attn-probs-mha

Conversation

@amancini-N
Copy link
Contributor

Description

Add an additional optional output to MultiHeadAttention op, allowing to return attention probabilities.

Motivation and Context

@tianleiwu
Copy link
Contributor

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline

@tianleiwu
Copy link
Contributor

/azp run Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-linux-gpu-ci-pipeline,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline,Android CI Pipeline

@tianleiwu
Copy link
Contributor

/azp run iOS CI Pipeline,ONNX Runtime React Native CI Pipeline,CoreML CI Pipeline,Linux DNNL CI Pipeline,Linux MIGraphX CI Pipeline,Linux ROCm CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 6 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 10 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 9 pipeline(s).

T* attn_probs_data = nullptr;
if (attn_probs == nullptr) {
size_t bytes = SafeInt<size_t>(batch_size) * num_heads_ * sequence_length * total_sequence_length * sizeof(T);
attention_probs = allocator->Alloc(bytes);
Copy link
Contributor

@tianleiwu tianleiwu Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to allocate extra space if we do not output it. You can follow the handling of output_qk (temp result of q*k before softmax) in this function.

If we do not output both q*k and softmax(q*k), we can consolidate them together by using a boolean flag to indicate whether we need output the one before softmax or after softmax.

"or present state for self attention value with shape (batch_size, num_heads, total_sequence_length, head_size)",
"T",
OpSchema::Optional)
.Output(3,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will need update documents (You can find the updated documents in artifacts of Windows GPU Doc Gen CI Pipeline for this PR).

auto& key_shape = getInputShape(ctx, 1);
auto& key_seqlen_dim = key_shape.dim()[1];
auto& past_seqlen_dim = getInputShape(ctx, past_key_index).dim()[2];
if (key_seqlen_dim.has_dim_value() && past_seqlen_dim.has_dim_value()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a condition of !past_present_share_buffer here.

@snnn
Copy link
Contributor

snnn commented Jul 3, 2025

This pull request has been automatically closed because it has merge conflicts and has been inactive for more than 30 days. Please rebase on the target branch and open a new PR.

@snnn snnn closed this Jul 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MultiHeadAttention op shall return attention probabilities

3 participants